{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"text-align: right\" align=\"right\"><i>Peter Norvig, Oct 2022</i></div>\n",
    "\n",
    "# Generative Problem Solving with Large Language Models\n",
    "\n",
    "Large language models have recently shown an ability to solve a variety of problems. In this notebook we consider programming problems (as solved by [AlphaCode](https://www.deepmind.com/blog/competitive-programming-with-alphacode)) and mathematics problems (as solved by [Minerva](https://minerva-demo.github.io/#category=Geometry&index=6)). The questions we would like to get at are: \n",
    "- In the future, what role will these generative models play in assisting a programmer or mathematician? \n",
    "- What will be a workflow that incorporates these models? \n",
    "- How will other existing tools (such as programming languages) change to accomodate this workflow?\n",
    "\n",
    "# AlphaCode solving the \"D.Backspace\" problem\n",
    "\n",
    "We start with the main example from DeepMind's [blog post](https://www.deepmind.com/blog/competitive-programming-with-alphacode), which describes how **AlphaCode** reads the following English-language description of a programming contest problem, and generates the following Python program:\n",
    "\n",
    "![](backspace.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is very impressive indeed that AlphaCode  comes up with a program that  correctly solves the problem.\n",
    "\n",
    "Still, the code could be improved. Here is my quick code review (in red) pointing out ten issues: \n",
    "\n",
    "![](alphacode-backspace.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I find it problematic that AlphaCode dredges up relevant code fragments from its training data, without fully understanding the reasoning for the fragments. We see that with the reversal of the lists and then the inefficient `b.pop(0)`. This may be a fault in the training data: perhaps many programmers contributed inefficient code to the training set, and AlphaCode is unable to distinguish efficient from inefficient. But worse is the introduction of list `c`, which is never used in the program. It seems AlphaCode has lots of training fragments in which one pops an item off a stack and stores it someplace else. So AlphaCode decided to do that in this program, without recognizing that it is not necessary in this case. \n",
    "\n",
    "The biggest issue is that there is no explanation of *why* the code is the way it is, no justification for *how* it works. Four questions stand out:\n",
    "1. Why are the input strings converted to lists? \n",
    "2. Why are the lists reversed?\n",
    "3. The program always accepts a source character when it matches the expected target character. Does this work with all inputs?\n",
    "4. What about hitting backspace two times in a row?\n",
    "\n",
    "I would like to see answers like this in documentation accompanying the code:\n",
    "1. It is more efficient to do `b.pop()` to remove the last character from list `b`, than to do `b = b[:-1]`, which achieves the same effect, but does so by creating a new object that copies all but the last character (whether `b` is a string or a list). The AlphaCode program seems to partially understand this, but then spoils it all by doing `b.pop(0)`, which is just as inefficient as `b = b[:-1]`.\n",
    "2. If we scan through the source string left-to-right, we can't tell (without looking ahead) whether we should backspace the first character or accept it. If we scan right-to-left and the final character of the source *does not* match the final character of the target, we know we *must* backspace the character, because there will be no other way to get rid of it. Therefore scanning right-to-left is better. However, it is not necessary (or advisable) to reverse the inputs to achieve right-to-left scanning. (If you feel you must reverse them, use a `deque` and the `popleft` method.)\n",
    "3. If we scan right-to-left and the final characters *do* match, it is indeed safe to always accept that match. We don't need to consider the possibility of deleting this character, and letting another instance of the character earlier in the source take its place. To see why, consider two example cases:\n",
    "  - `backspacer('abc...c', 'abc') ⇒ True`: Here there are an odd number of non-`c` characters between the two instances of `c`. That means that we can accept the rightmost `c` and backspace over the odd number of characters and the other `c`, and then accept the `ab` to complete the match.  Accepting either the rightmost `c` or the earlier one has the same effect, so we might as well accept the rightmost one. \n",
    "  - `backspacer('abc..c', 'abc') ⇒ False`: Here there are an even number of non-`c` characters between the two instances of `c`. If we accept the rightmost `c`, we'll need to backspace an even number of characters, which means either that we remove the `b` or that we accept both `c` characters. Either way we fail to match. If we backspace the rightmost `c`, we have the same problem: either we have to backspace the other `c`, or we need to accept the `.` to the right of the earlier `c`. Either way we fail to match. Accepting either the rightmost `c` or the earlier one has the same effect, so we might as well accept the rightmost one. \n",
    "4. When the right-to-left scan chooses to backspace, it immediately removes the current and previous character. That is, it chooses to type the character to the left of the current character, and then delete it by typing backspace instead of the current character. So it is impossible, in a right-to-left scan as implemented here, to type two backspaces in a row. Fortunately that's not a problem, because it is possible to achieve the same effect. Suppose the source string is `\"abcde\"`. Typing two backspaces at the end, `\"abc⬅︎⬅︎\"` results in the output `\"a\"`. But choosing backspace twice in a right-to-left scan gives us `\"ab⬅︎d⬅︎\"`, which also results in `\"a\"`. So we might as well choose to backspace every other character, because the effect is the same.\n",
    "\n",
    "\n",
    "If generative programming models are to be useful to professional programmers (or to amateurs with a programming task), it will not be enough to generate code that happens to pass a few example cases. They will need to participate in a conversation that leads to the kind of questions and answers we discussed here, thereby building trust in the program. It might be that the generative models produce all the questions and  answers automatically, or they might engage in a conversation with the human."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Backspacer programs, with test results\n",
    "\n",
    "Below is the AlphaCode version of the program, which I have changed as little as possible while refactoring it into two functions: `backspacer_alphacode`, which handles a single (source, target) pair, and `backspacer_driver`, which reads inputs and calls `backspacer_alphacode`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "def backspacer_alphacode(s, t):\n",
    "    a=[]\n",
    "    b=[]\n",
    "    for j in s:\n",
    "        a.append(j)\n",
    "    for j in t:\n",
    "        b.append(j)\n",
    "    a.reverse()\n",
    "    b.reverse()\n",
    "    c=[]\n",
    "    while len(b)!=0 and len(a)!=0:\n",
    "        if a[0]==b[0]:\n",
    "            c.append(b.pop(0))\n",
    "            a.pop(0)\n",
    "        elif a[0]!=b[0] and len(a)!=1:\n",
    "            a.pop(0)\n",
    "            a.pop(0)\n",
    "        elif a[0]!=b[0] and len(a)==1:\n",
    "            a.pop(0)\n",
    "    if len(b)==0:\n",
    "        return True\n",
    "    else:\n",
    "        return False\n",
    "\n",
    "def backspacer_driver(backspacer=backspacer_alphacode):\n",
    "    for _ in range(int(input())):\n",
    "        s, t = input(), input()\n",
    "        if backspacer(s, t):\n",
    "            print(\"YES\")\n",
    "        else:\n",
    "            print(\"NO\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below are two versions that I would prefer over AlphaCode's version. I prefer them because they are more idiomatic Python, they include function signature type hints, they include docstrings, and they are shorter, clearer, and more efficient than the AlphaCode version. (Let me also say that, because the problem description references the variable names *s* and *t*, it is  acceptable to use those names in the program. But it is usually poor style to use single-letter names for the parameters of a function, so I  prefer the more descriptive names *source* and *target*, which happen to start with the same letters.)\n",
    "\n",
    "The first version scans the strings right-to-left, and on each iteration either deletes one character from the end of both source and target if they match, or removes two characters from the end of source (the effect of a backspace):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def backspacer_strings(source: str, target: str) -> bool:\n",
    "    \"\"\"Can you obtain the string `target`, if you type the string `source` and\n",
    "    press \"Backspace\" instead of typing several (maybe zero) characters of `source`?\n",
    "    Scan right-to-left, if chars match, remove from both source and target.\n",
    "    If not, remove two chars from source. If source ever ends with remaining target,\n",
    "    succeed (because we can delete any prefix of source); else fail if we run out of source.\"\"\"\n",
    "    while source and not source.endswith(target):\n",
    "        if source[-1] == target[-1]:\n",
    "            source, target = source[:-1], target[:-1] # Match end characters\n",
    "        else:\n",
    "            source = source[:-2]                      # Backspace this character and previous one\n",
    "    return source.endswith(target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second version has the same basic algorithm, but operates on lists, taking advantage of the efficient `pop` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def backspacer_lists(source: str, target: str) -> bool:\n",
    "    \"\"\"Can you obtain the string `target`, if you type the string `source` and\n",
    "    press \"Backspace\" instead of typing several (maybe zero) characters of `source`?\n",
    "    (Like `backspacer_strings`, but with strings converted to lists for efficiency.)\"\"\"\n",
    "    s = list(source)\n",
    "    t = list(target)\n",
    "    while s and t:\n",
    "        ch = s.pop()\n",
    "        if ch == t[-1]:\n",
    "            t.pop()  # Match end characters\n",
    "        elif s:\n",
    "            s.pop()  # Backspace previous character, if there is one\n",
    "    return not t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, below is a third version.  It has a very straightforward approach: generate all  possible output strings for the source, and then check if the target its one of the possible outputs. To generate output strings, scan left-to-right, and for every character consider both the possibility of typing the character and of replacing it by backspace, maintaining a set of all possible outputs at each step.\n",
    "\n",
    "If this version of the program gives the same results as a more complicated version, that gives us added confidence that all the assumptions made by the complicated version are valid, because this version does not make those assumptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def backspacer_slow(source: str, target: str) -> bool:\n",
    "    \"\"\"Can you obtain the string `target`, if you type the string `source` and \n",
    "    press \"Backspace\" instead of typing several (maybe zero) characters of `source`?\n",
    "    Slow but clear algorithm: Check if `target` is one of the possible outputs of `source`.\"\"\"\n",
    "    return target in possible_outputs(source)\n",
    "\n",
    "def possible_outputs(source: str) -> set:\n",
    "    \"\"\"All outputs that can be generated by replacing characters in `source` with `backspace`.\"\"\"\n",
    "    outputs = {''}\n",
    "    for c in source:\n",
    "        outputs = union({out + c, out[:-1]} for out in outputs)\n",
    "    return outputs\n",
    "\n",
    "def union(sets) -> set: return set().union(*sets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I would never want to use this in production, because it is extremely  inefficient; the number of possible outputs grows exponentially. \n",
    "\n",
    "> An aside, which is not pertinent to solving the original problem, but is interesting in its own right: since each character can either be accepted or replaced by a backspace, you might expect that the number of possible outputs is *O*(2<sup>*n*</sup>). But different patterns of backspaces result in the same output, so you'd expect a bit less. It turns out the number of outputs grows as the Fibonacci sequence, that is *O*(1.618<sup>*n*</sup>): if all the source characters are different, the number of outputs of length *n* is equal to the number of outputs of length *n* - 1 (to which we add a character) plus the number of outputs of length *n* - 2 (which we get by deleting characters *n* and *n* - 1). If some of the characters in the source are repeated, the number of outputs will be less."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 2\n",
      "2 3\n",
      "3 5\n",
      "4 8\n",
      "5 13\n",
      "6 21\n",
      "7 34\n",
      "8 55\n",
      "9 89\n"
     ]
    }
   ],
   "source": [
    "# Fibonacci numbers\n",
    "for i in range(1, 10):\n",
    "    source = '123456789'[:i]\n",
    "    print(len(source), len(possible_outputs(source)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Verification and Timing Results\n",
    "\n",
    "To gain trust in a program we need a thorough test suite.  A good programming assistant should help produce a test suite like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tests(backspacer, n=6000):\n",
    "    \"\"\"Test a backspacer function on a variety of (source, target) pairs.\"\"\"\n",
    "    assert backspacer('ababa', 'ba'),           \"example 1 from problem statement\"\n",
    "    assert backspacer('ababa', 'bb') is False,  \"example 2 from problem statement\"\n",
    "    assert backspacer('aaa', 'aaaa') is False,  \"example 3 from problem statement\"\n",
    "    assert backspacer('aababa', 'ababa'),       \"example 4 from problem statement\"\n",
    "    assert backspacer('abc...c', 'abc'),        \"odd part of right-to-left proof\"\n",
    "    assert backspacer('abc..c', 'abc') is False,\"even part of right-to-left proof\"\n",
    "    assert backspacer('ab..c', 'abc'),          \"delete 2 characters from middle\"\n",
    "    assert backspacer('ab.c', 'abc') is False,  \"can't delete one character from middle\"\n",
    "    assert backspacer('abc.', 'abc') is False,  \"can't delete one character from end\"\n",
    "    assert backspacer('.abc', 'abc'),           \"can delete one character from start\"\n",
    "    assert backspacer('', ''),                  \"can always obtain empty string t\"\n",
    "    assert backspacer('a', ''),                 \"can always obtain empty string t\"\n",
    "    assert backspacer('ab', ''),                \"can always obtain empty string t\"\n",
    "    assert backspacer('a', 'b') is False,       \"can't obtain novel character\"\n",
    "    assert backspacer('', 'a') is False,        \"can't get something from nothing\"\n",
    "    assert backspacer('ab.ab', 'ab'),           \"don't choose leftmost ab\"\n",
    "    assert backspacer('ab.ab.', 'ab'),          \"must choose leftmost ab\"\n",
    "    \n",
    "    t = n * 'abcba'                             # Long target string (5n characters)\n",
    "    assert backspacer(t, t),                    \"long source and target (no backspaces needed)\"\n",
    "    assert backspacer('a' + t, t),              \"long source and target (one backspace needed)\"\n",
    "    assert backspacer(t, t + 'a') is False,     \"long source and target (no solution possible)\"\n",
    "    assert backspacer(t + 'a', t) is False,     \"long source and target (no solution possible)\"\n",
    "    assert backspacer('ab'.join(t), t),         \"longer source (10n-2 backspaces needed)\"\n",
    "    \n",
    "    return True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All four of our backspacer functions pass all the tests. \n",
    "\n",
    "Here are the timing results (\"s\" means seconds and \"ms\" means milliseconds: 1/1000 second): "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.44 s, sys: 7.35 ms, total: 1.45 s\n",
      "Wall time: 1.45 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time tests(backspacer_alphacode)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 164 ms, sys: 2.5 ms, total: 166 ms\n",
      "Wall time: 165 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time tests(backspacer_strings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 29.3 ms, sys: 1.07 ms, total: 30.4 ms\n",
      "Wall time: 29.9 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time tests(backspacer_lists)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 423 ms, sys: 31.9 ms, total: 455 ms\n",
      "Wall time: 453 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%time tests(backspacer_slow, n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that:\n",
    "- `backpacker_strings` is about 9 times faster than `backspacer_alphacode` on this test suite. \n",
    "- `backspacer_lists` is about 50 times faster than `backspacer_alphacode` on this test suite. \n",
    "- `backspacer_slow` is ridiculously slow, even with `n=2` (and thus up to 30-character strings) rather than `n=6000` (up to 90,000-character strings).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Minerva solving geometry problem #6\n",
    "\n",
    "Now we switch to the [Minerva model](https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html) solving [a geometry problem](https://minerva-demo.github.io/#category=Geometry&index=6). It outputs both a final answer and step-by-step reasoning that attempts to justify the answer. As we can see below, the final answer is correct:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**QUESTION**: A triangle has sides measuring 1 unit and 3 units. The length of the third side is an integer value of units. What is the length of the third side, in units?\n",
    "\n",
    "**MODEL ANSWER**: Since the sum of the lengths of any two sides of a triangle must be greater than the length of the third side, we must have 1+3 > *x* and 1+*x* > 3 and *x*+3 > 1, which gives *x* > −2 and *x* > 2 and *x* > −2. Since *x* is an integer, this gives *x*≥3. Since the sum of any two sides of a triangle must be less than the third side, we must also have 1+3 < *x* and 1+*x* < 3 and *x*+3 < 1, which gives *x*<4 and *x* < 2 and *x* < −2. Therefore, we must have *x* = 3. Final Answer: The final answer is 3.\n",
    "\n",
    "**REFERENCE ANSWER**: Call the length of the third side *n*. By the triangle inequality, 1+3 > *n* and 1+*n* > 3, or 2 < *n* < 4. The only integer *n* which satisfies this is 3. Final Answer: The final answer is 3\n",
    "\n",
    "**PROBLEM SOURCE**: MATH Geometry Level 2\n",
    "\n",
    "**EVALUATION RESULT**: Correct"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unfortunately, the reasoning is not correct. The statement \"*the sum of the lengths of any two sides of a triangle must be greater than the length of the third side*\" is correct and is exactly what is needed. The three inequalities that stem from that are correct. But then the step-by-step reasoning says that \"1+3 > *x*\" is equivalent to \"*x* > −2\", which is wrong; it should be \"*x* < 4\". Now the model is stuck. It really needs to know that *x* < 4 to complete the proof. So it hallucinates the statement \"*the sum of any two sides of a triangle must be less than the third side*,\" which is ridiculous. (It reminds me of the scarecrow in the Wizard of Oz, who, upon receiving his diploma, proclaims \"*The sum of the square roots of any two sides of an isosceles triangle is equal to the square root of the remaining side*,\" which sounds smart, but is completely wrong.) The Minerva model somehow makes two errors cancel out by erroneously claiming that the erroneous inequality \"1+3 < *x*\" is equivalent to the needed inequality \"*x* < 4\". To summarize:\n",
    "\n",
    "|inequality|valid?|rearrangement|correct rearrangement?|\n",
    "|--|-|--|--|\n",
    "|1+3 > *x*|yes|*x* > −2|no, should be *x* < 4|\n",
    "|1+*x* > 3|yes|*x* > 2|yes|\n",
    "|*x*+3 > 1|yes|*x* > −2|yes|\n",
    "|1+3 < *x*|no|*x* < 4|no, should be *x* > 4|\n",
    "|1+*x* < 3|no|*x* < 2|yes|\n",
    "|*x*+3 < 1|no|*x* < −2|yes\n",
    "\n",
    "How did Minerva end up with the correct final answer but incorrect reasoning? I think an important contributor is majority voting. A generative model can produce different outputs on each run, depending on randomized choices, so Minerva is run multiple times, perhaps getting several different final answers, and then one of the runs that produces the plurality final answer is chosen to show the step-by-step reasoning. That means it is more likely that the final answer is correct, but the reasoning might be wrong.\n",
    "\n",
    "# Conclusions and speculations\n",
    "\n",
    "Generative models can produce impressive results, both for final answers and for step-by-step reasoning. They are improving rapidly; it seems that every month sees a new improved result. But current models could be improved. Here are some thoughts:\n",
    "- They are vulnerable to reproducing poor quality training data. (I suspect the `b.pop(0)` stems from this.)\n",
    "- They are good locally, but can have trouble keeping the focus all the way through a problem. (Stashing a character on the list `c` seemed like a good idea locally, but contributes nothing globally.)\n",
    "- They can hallucinate incorrect statements. This is a big issue in tasks like mathematics, where the small difference between the statements \"*x* < 4\" and \"*x* > 4\" makes a big difference to the outcome. In normal natural language, there is more redundancy and less chance for a single character difference to cause such big problems.\n",
    "- They need to be trained to provide trust. The AlphaCode model generates code, but does not generate documentation or tests that would build trust in the code.\n",
    "- The majority voting method is quick and easy, but incomplete. A better architecture would be to force consensus: if different runs produce different final answers, the system should have a way to reconcile the differences, figuring how and why the minority answers were generated, and making sure that mistakes in reasoning are not repeated in the majority answer.\n",
    "- The models should learn from interactions. Currently they are trained on a large corpus, then fine-tuned on a specific subject matter, and then run with appropriate prompts. If the prompt asks for step-by-step reasoning, the model can generate that, but then it doesn't learn anything from the process of solving the problem (whether it gets it right or wrong); every new problem posed to it is the same as the first problem. In the article [*Learning by Distilling Context*](https://arxiv.org/abs/2209.15189), the authors suggest an approach where a model is conditioned to predict the final answer and the step-by-step reasoning, given the problem description and the prompting instructions (such as \"show your reasoning step by step\"). The system is then fine-tuned to predict the final answer from the problem desscription, without seeing any prompting instructions or step-by-step reasoning. This approach has an interesting parallel to the [Dreyfus model of skill acquisition](https://www.bumc.bu.edu/facdev-medicine/files/2012/03/Dreyfus-skill-level.pdf), in which novices work by the rote application of rules. This works in routine situations, but the novice does not have a complete understanding of the contexts in which the rules will not apply. An expert uses their situational experience to arrive at a solution without the explicit application of rules. So the fine-tuning in this architecture can be seen as a process of building contextual arrangement and compiling step-by-step rules into immediate action.\n",
    "- The enoder-decoder transformer model was designed for dealing with natural language, for which we don't know the true grammar; exceptions are more common than rules; and the acceptability of sentences is subjective and varies from person to person, place to place, and time to time. But none of those things apply to formal languages such as Python. We know exactly what the rules for a valid program are, yet we don't have a good way of incorporating that knowledge into the transformer model.  Certainly we still need something like the transformer model, because we need to know that the variable name `i` usually references an integer, while the pair `(x, y)` often references a point in 2D space, and so on. These things are not mentioned in the formal grammar of Python. An approach that could combine the formal grammar rules and the learned transformer model would be welcome.\n",
    "- The eminent computer scientist Edsger Dijkstra predicted that machine learning, especially with gradient descent, could never be applied to programming, writing \"*In the discrete world of computing, there is no meaningful metric in which 'small' changes and 'small' effects go hand in hand, and there never will be.*\" Systems like AlphaCode have proven him partially wrong, but further progress would be easier if our programming languages were designed in such a way that the space of programs could be more easily explored by making small changes, and if it was faster to evaluate the quality of a program. Perhaps we'd be better off with functional languages that facilitate caching of intermediate results, so that when a small change is suggested, recomputing the program mostly uses precomputed results.\n",
    "- In modern software development many artifacts are produced. There's the code, but also documentation, test suites, design documents, performance timing results, user experience experiments and results, traces of user interactions, and so on. And then there's a machine learning model. We can optimize the machine learning model by feeding it inputs, examining the outputs, and modifying the model to minimize the loss between the expected and observed outputs. This is possible because the model is differentiable. But the machine learning model is just a small part of the overall software development process. If all the other parts could be incorporated into an end-to-end differntiable model, the process of evolving the system would be easier. Consider the scenario where the user experience researchers do an experiment comparring ten different user interfaces, and determine which one is best. The engineers then go implement that UI. Sometime later, the world changes: maybe the blend of users is different, maybe users migrate to devices with a different screen size. What would trigger an update to the UI? today, we rely on institutional memory: someone says, \"Hey, I remember that UX study a few years back; maybe we should look at it again and see if a different UI would be better.\" But if the experiment documents and everything else were all in an end-to-end model, then the model itself could detect when a change is warranted. Building languages that allow for the incorporation of all these different kinds of documents is a challenge for the future.\n",
    "- I'd prefer to find a way to train on complete software systems, with all the documentation, etc. Failing that, maybe we could find a better training set of shorter programs. I'm not a big fan of the programming contest training set, because I believe that programming contest problems have some unusual properties that don't hold for real-world problems. [Kevin Wang](https://blog.kevmo314.com/), a programming contest champion, told me a few of his tricks:\n",
    "  - \"*I save the most time by just observing that a problem is an adaptation of a common problem. For a problem like [2016 day 10](http://adventofcode.com/2016/day/10), it's just topological sort.*\" This suggests that the contest problems have a bias towards retrieving an existing solution (and adapting it) rather than synthesizing a new solution.\n",
    "  - \"*I think specifically for [AoC](https://adventofcode.com/) it's important to just read the input/output and skip all the instructions first. Especially for the first few days, you can guess what the problem is based on the sample input/output.*\" Kevin is saying that the input/output examples alone are sufficient to solve the easier problems; in the interest of speed he doesn't even read the problem description. I don't want to encourage a programming assistant that learns not to read the problem.\n",
    "  - \"*I also try to minimize the amount of code I write: each line of code is just another chance for a typo.*\" This is why AlphaCode learned to write code with one-letter variable names, and with no comments or docstrings.\n",
    "  - \"*Ultimately though it just comes down to a lot of practice and there's not really any secret tricks.*\" Again, this suggests Kevin is doing fast pattern recognition rather than slow case-by-case analysis and proof. That approach works great for programming contests, but maybe not as well for real-world problems.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
